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Abstract 

Peer to peer (P2P) systems are moving from applica- 
tion specific architectures to a generic service oriented de- 
sign philosophy. This raises interesting problems in con- 
nection with providing useful P2P middleware services ca- 
pable of dealing with resource assignment and management 
in a large-scale, heterogeneous and unreliable environment. 
The slicing service, has been proposed to allow for an au- 
tomatic partitioning of P2P networks into groups (slices) 
that represent a controllable amount of some resource and 
that are also relatively homogeneous with respect to that 
resource. In this paper we propose two gossip-based al- 
gorithms to solve the distributed slicing problem. The first 
algorithm speeds up an existing algorithm sorting a set of 
uniform random numbers. The second algorithm statisti- 
cally approximates the rank of nodes in the ordering. The 
scalability, efficiency and resilience to dynamics of both al- 
gorithms rely on their gossip-based models. These algo- 
rithms are proved viable theoretically and experimentally. 

Keywords: Slice, Gossip, Churn, Peer-to-Peer, Aggrega- 
tion, Large Scale. 

1. Introduction 

1.1. Context and Motivations 

The peer to peer (P2P) communication paradigm has 
now become the prevalent model to build large-scale dis- 
tributed applications. On the one hand, P2P protocols in- 
tegrate into platforms on top of which several applications, 
with various requirements, may cohabit. This leads to the 
interesting issue of resource assignment or how to allocate 
a set of nodes for a given application. Examples of appli- 
cations for such a service are telecommunication, testbed 
platform O, or desktop-grid-like applications (l2j. On the 



other hand, P2P systems should be able to balance the load 
taking into account that capabilities are heterogeneous at the 
peers |fT9l |4l l20l . This heterogeneity has some drawbacks. 
Completely decentralized P2P application, like the origi- 
nal Gnutella [8J, suffered from congestion when applied to 
large-scale systems because nodes with a low bandwidth 
capabiUty were queried. Consequently, file sharing appli- 
cations lfT6l |9l tend to request ultrapeers/supernodes (peers 
with larger lifetime and bandwidth capabilities), more often 
than regular peers. P2P protocols must identify efficiently 
and accurately peers with specific capabilities. 

Large scale dynamic distributed systems consist of many 
participants that can join and leave at will. Identifying peers 
in such systems that have a similar level of power or ca- 
pability (for instance, in terms of bandwidth, processing 
power, storage space, or uptime) in a completely decentral- 
ized manner is a difficult task. It is even harder to maintain 
this information in the presence of churn. Due to the intrin- 
sic dynamics of contemporary P2P systems it is impossi- 
ble to obtain accurate information about the capabilities (or 
even the identity) of the system participants. Consequently, 
no node is able to maintain accurate information about all 
the nodes. This disqualifies centrahzed approaches. 

The slicing service ifTsll enables peers to self-organize 
into a partitioning, where partitions (slices) are connected 
overlay networks that represent a given percentage of some 
resource. The slicing is ordered in the sense that peers are 
sorted according to their capabilities expressed by an at- 
tribute value. Building upon the work on ordered slicing 
of 1 13 1, here we focus on the issue of accurate slicing. That 
is, we focus on improving quality by slicing the network 
accurately, and improving stability of slices by minimizing 
the impact of the churn. The distributed slicing problem 
we tackle in this paper consists in ranking nodes depending 
on their relative capability, slicing the network depending 
on these capabilities and, most importantly, readapting the 
shoes continuously to cope with system dynamism. 



1.2. Contributions 

The paper presents two distributed algorithms to slice the 
nodes according to their capability, reflected by an attribute 
value. Theses algorithms are robust and lightweight due to 
their gossip-based communication pattern. The first algo- 
rithm of the paper builds upon the ordered slicing algorithm 
proposed in fT3l that we call the JK algorithm in the sequel 
of this paper. This algorithm speeds up the convergence of 
JK by locally computing a disorder measure so that a peer 
chooses the neighbor to communicate with in order to max- 
imize the chance of decreasing the global disorder measure. 

Then, we identify two issues that prevent accurate slicing 
and motivate us to find an alternative approach to this algo- 
rithm and JK. First, the slicing might be inaccurate. Ran- 
dom values are used to calculate which slice a node belongs 
to. The accuracy of the slicing fully depends on the unifor- 
mity of the random value spread between and 1. (e.g., the 
proportion of random values between 0.8 and 1 should be 
ideally 20% of the nodes). Second, the previous algorithms 
suffer from churn an dynamism when correlated with the 
attribute values. For example, if the peers are sorted accord- 
ing to their connectivity potential, a portion of the attribute 
space (and therefore the random value space) might be sud- 
denly affected. The consequence is to skew the distribution 
of random values towards high or low values. 

The second algorithm is an alternative algorithm solving 
these two issues by approximating locally the rank of the 
nodes, without using random values. The basic idea is that 
each node periodically estimates its rank along the attribute 
axis depending on the attributes it have seen so far. Based 
on continuously aggregated information, the node can de- 
termine the shce it belongs to with a decreasing error mar- 
gin. We show that this algorithm provides accurate estima- 
tion and recovery ability in presence of attributes-correlated 
churn at the price of a slower convergence. 

1.3. Outline 

The rest of the paper is organized as follows: Section |2] 
surveys some related work. The system model is presented 
in Section [3] The first contribution of an improved ordered 
slicing algorithm based on random values is presented in 
Section|4]and the second algorithm based on dynamic rank- 
ing in Section|5] Section|6]concludes the paper. 

2. Related Work 

Most of the solutions proposed so far for ordering nodes 
come from the context of databases f5^, ITTl, where paral- 
lelizing query executions is used to improve efficiency. A 
large majority of the solutions in this area rely on central- 



ized gathering or all-to-all exchange, which makes them un- 
suitable for large-scale networks. 

Other related problems are the selection problem and the 
(^-quantile search. The selection problem Q aims at deter- 
mining the i*'* smallest element with as few comparisons 
as possible. The (p-quantile search (with e (0, 1]) is 
the problem to find among n elements the {(pnf^ element. 
Even though these problems look similar to our problem, 
they aim at finding a specific node among all, while the dis- 
tributed slicing problem aims at solving a global problem 
where each node maintains a piece of information. Addi- 
tionally, solutions to the quantile search problem like the 
one presented in IfTTI use an approximation of the system 
size. The same holds for the algorithm in |18|, which uses 
similar ideas to determine the distribution of a utility in or- 
der to isolate peers with high capability — i.e., super-peers. 

As far as we know, the distributed slicing problem was 
studied in a P2P system for the first time in [ 13 1. In this pa- 
per, every node draws independently and uniformly a ran- 
dom value in the interval (0, 1]. Each of these values serve 
as an estimate of normalized index k/n for the node with 
the k*-^ smallest attribute value. 

3. Model and Problem Statement 

3.1. System model 

We consider a system S containing a set of n uniquely 
identified nodes. (The value n may vary over time.) The set 
of identifiers is denoted by / C N. Each node can leave and 
new nodes can join the system at any time, thus the number 
of nodes is a function of time. Nodes may also crash. In 
this paper, we do not differentiate between a crash and a 
voluntary node departure. 

Each node i maintains a fixed attribute value a,; £ N, re- 
flecting the node capability according to a specific metric. 
These attribute values over the network might have an ar- 
bitrary skewed distribution. Initially, a node has no global 
information neither about the structure or size of the system 
nor about the attribute values of the other nodes. 

We can define a total ordering over the nodes based 
on their attribute value, with the node identifier used to 
break ties. Formally, we let i precede j if and only if 
Qi < ttj, or Qi = Gj and i < j. We refer to this totally 
ordered sequence as the attribute-based sequence, denoted 
by A. sequence. The attribute-based rank of a node i, de- 
noted hy ai e {1, •■, n}, is defined as the index of in 
A. sequence. 

3.2. Distributed Slicing Problem 

Let 5; II denote the slice containing every node i whose 
normalized rank, namely — , satisfies I < — < u where 
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Figure 1. Slicing of a population based on a 
height attribute. 



I G [0, 1) is the slice lower boundary and u € (0, 1] is the 
slice upper boundary so that all slices represent adjacent in- 
tervals {li, ui], {I2, U2]-- Let us assume that we partition 
the interval (0, 1] using a set of slices, and this partitioning 
is known by all nodes. The distributed slicing problem re- 
quires each node to determine the slice it currently belongs 
to. Note that the problem stated this way is similar to the or- 
dering problem, where each node has to determine its own 
index in A. sequence. However, the reference to slices in- 
troduces special requirements related to stability and fault 
tolerance, besides, it allows for future generalizations when 
one considers different types of categorizations. 

Figure[T]illustrates an example of a population of 10 per- 
sons, to be sorted against their height. A partition of this 
population could be defined by two slices of the same size: 
the group of short persons, and the group of tall persons. 
This is clearly an example where the distribution of attribute 
values is skewed towards 2 meters. The rank of each person 
in the population and the two slices are represented on the 
bottom axis. Each person is represented as a small cross on 
these axesQ Each slice is represented as an oval. The slice 
= iSq 1 contains the five shortest persons and the slice 
S2 = Si I contains the five tallest persons. 

Since the distribution of attribute values is unknown and 
hard to predict, defining relevant groups is a difficult task. 
For example, if the distribution of the human heights were 
unknown, then the persons taller than Irn could be consid- 
ered as tall and the persons shorter than Im could be con- 
sidered as short. In this case, the first of the two groups 
would be empty, while the second of the two groups would 
be as big as the whole system. Conversely, slices partition 
the population into subsets representing a predefined por- 
tion of this population. Therefore, in the rest of the paper, 
we consider slices as defined as a proportion of the network. 

3.3. Facing Churn 

Node churn, that is, the continuous arrival and depar- 
ture of nodes is an intrinsic characteristic of P2P systems 
and may significantly impact the outcome, and more specif- 
ically the accuracy of the slicing algorithm. The easier case 

' Note that the shortest (resp. largest) rank is represented by a cross at 
the extreme left (resp. right) of the bottom axis. 



is when the distribution of the attribute values of the depart- 
ing and arriving nodes are identical. In this case, in princi- 
ple, the arriving nodes must find their slices, but the nodes 
that stay in the system are mostly able to keep their slice 
assignment. Even in this case however, nodes that are close 
to the border of a slice may expect frequent changes in their 
slice due to the variance of the attribute values, which is 
non-zero for any non-constant distribution. If the arriving 
and departing nodes have different attribute distributions, 
so that the distribution in the actual network of live nodes 
keeps changing, then this effect is amplified. However, we 
believe that this is a realistic assumption to consider that the 
churn may be correlated with some specific values (for ex- 
ample if the considered attribute is uptime or connectivity). 

4. Dynamic Ordering by Exchange of Random 
Values 

This section proposes an algorithm for the distributed 
slicing problem improving upon the original JK algorithm 
ifTSl . by considering a local measure of the global disorder 
function. In this section we present the algorithm along with 
the corresponding analysis and simulation results. 

4.1. On Using Random Numbers to Sort 
Nodes 

This Section presents the algorithm built upon JK. We re- 
fer to this algorithm as mod-JK (standing for modified JK). 
In JK, each node i generates a real number rt E (0, 1] in- 
dependently and uniformly at random. The key idea is to 
sort these random numbers with respect to the attribute val- 
ues by swapping (i.e., exchanging) these random numbers 
between nodes, so that if < aj then < rj. Even- 
tually, the attribute values (that are fixed) and the random 
values (that are exchanged) should be sorted in the same or- 
der That is, each node would like to obtain the x*^ largest 
random number if it owns the x*^ largest attribute value. 
Let R. sequence denote the random sequence obtained by 
ordering all nodes according to their random number Let 
Pi{t) denote the index of node i in R. sequence at time t. 
When not required, the time parameter is omitted. 

Once sorted, the random values are used to determine the 
portion of the network a peer belongs to. 

4.2. Definitions 

Every node i keeps track of some neighbors and their 
age. The age of neighbor j is a timestamp, tj, set to 
when j becomes a neighbor of i. Thus, node i maintains 
an array containing the id, the age, the attribute value, and 
the random value of its neighbors. This array, denoted Afi, 
is called the view of node i. The views of all nodes have 



Initial state of node i 

(1) period ^, initially set to a constant; 

r,, a random value chosen in (0, 1]; a^, the attiibute value; 
slicBi <— ±, the slice i belongs to; A/i, the view; 
gain j I , a real value indicating the gain achieved by 

exchanging with j'; 
gain-max = 0, a real. 

Active thread at node i 

(2) wa\t{periodi) 

(3) recompute-view()i 

(4) for/GM 

(5) a gairiji > gain-max thea 

(6) gain-max <— gain^i 

(V) i ^ j' 

(8) end for 

(9) send(REQ,ri,ai) to j 

(10) recv(ACK, r'.) from j 

(11) ^r^ 

(12) if {aj - ai){rj - r^) < tlien 

(13) n ^ rj 

(14) slicBi <— cS;,ii such that I < ri < u 

Passive tliread at node i activated upon reception 

(15) recv{REQ, , ) from j 

(16) send(ACK, ri) to i 

(17) if {aj — ai){rj — r^) < tlien 

(18) Vi ^ rj 

(19) slicBi <— cS; „ such that I < ri < u 



the same size, denoted by c. A node i participates in the 
algorithm by exchanging its rank with a misplaced neigh- 
bor in its view. Neighbor j is misplaced if and only if 
{ttj — ai){rj — ri) < 0. In [13], a measure of the rela- 
tive disorder of sequence R. sequence with respect to se- 
quence A. sequence was introduced. We call it the global 
disorder measure ( GDM) and it is defined, for any time t, 
as GDM{t) = i Y^ii^i - P{i)i)'^- The minimal value of 
GDM is 0, which is obtained when p{t)i = ai for all nodes 
i. In this case the attribute-based index of a node is equal 
to its random value index, indicating that random values are 
ordered. 

4.3. Improved Ordering Algorithm 

In this algorithm, each node i searches its own view A/i 
for misplaced neighbors. Then, one of them is chosen to 
swap random value with. This process is repeated until 
there is no global disorder. In this version of the algorithm, 
we provide each node with the capability of measuring dis- 
order locally. This leads to a new heuristic for each node to 
determine the neighbor to exchange with which decreases 
most the disorder. Referring to this disorder measure as a 
criterion, the decrease of the global criterion is related to the 
decrease of local criteria, similarly to lH]. 

For a node i to evaluate the gain of exchanging with a 
node j of its current view Afi, we define its local disor- 
der measure (abbreviated LDMi). Let LA.sequencci and 
LR.sequence^ be the local attribute sequence and the local 
random sequence of node i, respectively. These sequences 
are computed locally by i using the information Mi U {i}. 
Similarly to A. sequence and R. sequence, these are the se- 
quences of neighbors where each node is ordered according 
to its attribute value and random number, respectively. Let, 
for any j G A/i U {i}, ipj{t) and £aj{t) be the indices of 
rj and aj in sequences LR.sequence^ and LA.sequence^, 
respectively, at time (t). At any time t, the local disorder 
measure of node i is defined as: 

LDM,{t) ^ -1^ ^ {la,{t)-£p,{t)f.{l) 

je^,(t)u{i} 

We denote by G^j{t + 1) = LDMi{t) - LDM,{t+l), the 
reduction on this measure that i obtains after exchanging its 
random value with node j between time t and t + 1. 

The heuristic used chooses for node i the misplaced 
neighbor j that maximizes Gij{t + 1). 

Sampling uniformly at random. The algorithm relies on 
the fact that potential misplaced nodes are found so that they 
can swap their random numbers thereby increasing order If 
the global disorder is high, it is very likely that any given 
node has misplaced neighbors in its view to exchange with. 



Figure 2. Dynamic ordering algorithm. 

Nevertheless, as the system gets ordered, it becomes more 
unlikely for a node i to have misplaced neighbors. In this 
stage the way the view is composed plays a crucial role: if 
fresh samples from the network are not available, conver- 
gence can be slower than optimal. 

Several protocols may be used to provide a random and 
dynamic sampling in a P2P system such as Newscast ifTSll . 
Cyclon [21] or Lpbcast |12|. They differ mainly by their 
closeness to the uniform random sampling of the neighbors 
and the way they handle churn. In this paper, we chose to 
use a variant of the Cyclon protocol, to construct and update 
the views, as it is reportedly the best approach to achieve a 
uniform random neighbor set for all nodes ifTOl . 

Description of the algorithm. The algorithm is presented 
in Figure |2] The active thread at node i runs the member- 
ship (gossiping) procedure (recompute-view()i) and the ex- 
change of random values periodically. As motivated above, 
the membership procedure is similar to the Cyclon algo- 
rithm. This variant of Cyclon exchanges all entries of the 
view at each step and uses two additional parameters: the 
attribute value and the random value. For the detailed pseu- 
docode, please refer to the full version of this paper |6|. 

The algorithm for exchanging random values from node 
i starts by measuring the ordering that can be gained by 
swapping with each neighbor (Lines |3-[8]l. Then, i chooses 



the neighbor j € Mi that maximizes gain Gi^k for any of 
its neighbor k. Formally, i finds j G Afi such that for any 
k ^ N'i,we have 

G,,,(t+1) >G,,fc(t + l). 

In Figure |2] of node i, we refer to gaiUj as the value of 
£ai{t)£pj{t)+£aj{t)£pi{t)—£aj{t)£pj{t). This expression 
is directly obtained from equation ([T]l, see the full version 
of this paper |6 1 for furthter details. 

From this point on, i exchanges its random value 
with the random value rj of node j (Line fTTT i. The pas- 
sive threads are executed upon reception of a message. In 
Figure |2] when j receives the random value of node i, 
it sends back its own random value rj for the exchange to 
occur (Lines [TSj-O. Observe that the attribute value of i 
is also sent to j, so that j can check if it is correct to ex- 
change before updating its own random number (Lines \TT\ - 
[Tsl l. Node i does not need to receive attribute value aj of 
j, since i already has this information in its view and the 
attribute value of a node never changes over time. 

4.4. Analysis of Slice Inaccuracy 

In mod-JK, as in JK, the current random number of 
a node i determines the slice Si of the node. The objec- 
tive of both algorithms is to reduce the global disorder as 
quickly as possible. Algorithm mod-JK consists in choos- 
ing one neighbor among the possible neighbors that would 
have been chosen in JK, plus the GDM of JK has been 
shown to fit an exponential decrease. Consequently mod- 
JK experiences also an exponential decrease of the global 
disorder. Eventually, JK and mod-JK ensure that the disor- 
der has fully disappeared. For further information, please 
refer to 1 1 3 1 . 

However, the accuracy of the slices heavily depends on 
the uniformity of the random value spread between and 
1 . It may happen, that the distribution of the random values 
is such that some peers decide upon a wrong slice. Even 
more problematic is the fact that this situation is unrecov- 
erable unless a new random value is drawn for all nodes. 
This may be considered as an inherent limitation of the ap- 
proach. For example, consider a system of size 2, where 
nodes 1 and 2 have the random values ri ~ 0.1, 7'2 = 0.4. 
If we are interested in creating two slices Si and 5*2 of equal 
size (Si = Sq 1 and S2 = Si i), both nodes will wrongly 
believe to belong to the same slice 5*1, since ri and r2 be- 
long to (0, i]. This wrong estimate holds even after perfect 
ordering of the random values. 

Therefore, an important step is to characterize the inac- 
curacy of slice assignment and how likely it may happen. 
To this end, we bound the deviation of random values dis- 
tribution from the mean, and we lower bound the probability 
that this happen with only two slices. The following result 



bounds, with high probability, the number of nodes that can 
be misplaced in the system. For the proof of Lemma 14.11 
please refer to the full version of this paper |6|. 

Lemma 4.1 For any (3 G (0, 1], a slice Sp of length p G 
(0, 1] has a number of peers X G [(1 — I3)np, (1 + (3)np] 
with probability at least 1 — e as long as p > ln(2/e). 

To measure the effect discussed above during the simu- 
lation experiments, we introduce the slice disorder measure 
(SDM) as the sum over all nodes i of the distance between 
the slice i actually belongs to and the slice i believes it be- 
longs to. For example (in the case where all slices have the 
same size), if node i belongs to the 1** slice (according to 
its attribute value) while it thinks it belongs to the 3'^'' slice 
(according to its rank estimate) then the distance for node i 
is |1 — 3| — 2. Formally, for any node i, let Sui,ii be the 
actual correct slice of node i and let S^ f (t) be the slice i 
estimates as its slice at time t. The slice disorder measure is 
defined as: 

i 

SDM{t) is minimal (equals 0) if for all nodes i, we have 

'S'tr,j'(0 = Su,,u- 

4.5. Simulation Results 

We present simulation results using PeerSim lfT4l . us- 
ing a simplified cycle-based simulation model, where all 
messages exchanges are atomic, so messages never over- 
lap. First, we compare the performance of the two algo- 
rithms: JK and mod-JK. Second, we study the impact of 
concurrency that is ignored by the cycle-based simulations. 

Performance comparison. We compare the time taken 
by these algorithms to sort the random values according to 
the attribute values (i.e., the node with the j*'* largest at- 
tribute value of the system value obtains the j*'' random 
value). In order to evaluate the convergence speed of each 
algorithm, we use the slice disorder measure as defined in 
Section EH] 

We simulated lO'' participants in 100 equally sized slices 
(when unspecified), each with a view size c = 20. Fig- 
ure |43] presents the evolution of the slice disorder measure 
over time for JK, and mod-JK. 

Figure |43] shows the slice disorder measure to compare 
the convergence speed of our algorithm to that of JK with 10 
equally sized slices. Our algorithm converges significantly 
faster than JK. Note that none of the algorithm reaches zero 
SDM, since they are both based on the same idea of sorting 
randomly generated values. Besides, since they both used 
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Figure 3. Slice disorder measure over time. 



an identical set of randomly generated values, both con- 
verge to the same SDM. Note that for the sake of fairness, 
JK and mod-JK are compared using the same underlying 
view management protocol in our simulation: the variant of 
Cyclon. 

Concurrency. The simulations are cycle-based and at 
each cycle an algorithm step is done atomically so that 
no other execution is concurrent. More precisely, the al- 
gorithms are simulated such that in each cycle, each node 
updates its view before sending its random value or its at- 
tribute value. Given this implementation, the cycle-based 
simulator does not allow us to realistically simulate con- 
currency, and a drawback is that view is up-to-date when 
a message is sent. In the following we artificially intro- 
duce concurrency (so that view might be out-of-date) into 
the simulator and show that it has only a slight impact on 
the convergence speed. 

Adding concurrency raises some realistic problems due 
to the use of non-atomic push-pull fT2l| in each message ex- 
change. That is, concurrency might lead to other problems 
because of the potential staleness of views: unsuccessful 
swaps due to useless messages. Technically, the view of 
node i might indicate that j has a random value r while this 
value is no longer up-to-date. This happens if i has lastly 
updated its view before j swapped its random value with 
another j' . Moreover, due to asynchrony, it could happen 
that by the time a message is received this message has be- 
come useless. Assume that node i sends its random value 
Ti to j in order to obtain rj at time t and j receives it by 
time t + 5. With no loss of generality assume > rj. 
Then if j swaps its random value with j' such that > 
between time t and t + S, then the message of i becomes 
useless and the expected swap does not occur (we call this 
an unsuccessful swap). 

Figure [4(b)| indicates the impact of concurrent message 
exchange on the convergence speed, while Figure |4(a)| 



shows the amount of useless messages that are sent. Now, 
we explain how the concurrency is simulated. Let the over- 
lapping messages be a set of messages that mutually over- 
lap: it exists, for any couple of overlapping messages, at 
least one instant at which they are both in-transit. For each 
algorithm we simulated (i) full concurrency: in a given cy- 
cle, all messages are overlapping messages; and (ii) half 
concurrency: in a given cycle, each message is an overlap- 
ping message with probabiUty ^. Generally, we see that 
increasing the concurrency increases the number of useless 
messages. Moreover, in the modified version of JK, more 
messages are ignored than in the original JK algorithm. 
This is due to the fact that some nodes (the most misplaced 
ones) are more likely targeted which increases the number 
of concurrent messages arriving at the same nodes. Since a 
node i ignored more likely a message when it receives more 
messages during the same cycle, it comes out that concen- 
trating message sending at some targets increases the num- 
ber of useless messages. 

Figure [4(b)| compares the convergence speed under full 
concurrency and no concurrency. Full-concurrency impacts 
on the convergence speed very slightly. 

5. Dynamic Ranking by Sampling of Attributes 

In this section we propose an alternative algorithm for 
the distributed slicing problem. This algorithm circumvents 
the problems identified in the previous approach by contin- 
uously ranking nodes based on observing attribute value in- 
formation. Besides, this algorithm is not sensitive to churn 
even if it is correlated with attribute values. In the remain- 
ing part of the paper we refer to this new algorithm as the 
ranking algorithm while referring to JK and mod-JK as the 
ordering algorithms. 

Impact of dynamics correlated with attribute. As al- 
ready mentioned, the ordering algorithms rely on the fact 
that random values are uniformly distributed. However, if 
the attribute values are not constant but correlated with the 
dynamic behavior of the system, the distribution of random 
values may change from uniform to skewed quickly. For in- 
stance, assume that each node maintains an attribute value 
that represents its own lifetime. Although the algorithm is 
able to quickly sort random values, so nodes with small life- 
time will obtain the small random values, it is more likely 
that these nodes leave the system sooner than other nodes. 
This results in a higher concentration of high random val- 
ues and a large population of the nodes wrongly estimate 
themselves as being part of the higher slices. 

Inaccurate slice assignments. As discussed in previous 
sections in detail, slice assignments will typically be im- 
perfect even when the random values are perfectly ordered. 
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Figure 4. (a) Percentage of unsuccessful swaps, (b) Convergence speed under high concurrency. 



Since the ranking approach does not rely on ordering ran- 
dom nodes, this problem is not raised: the algorithm guar- 
antees eventually perfect assignment in a static environ- 
ment. 

Concurrency side-effect. In the previous ordering algo- 
rithms, a non negligible amount of messages are sent unnec- 
essarily. The concurrency of messages has a drastic effect 
on the number of useless messages as shown previously, 
slowing down convergence. In the ranking algorithm con- 
currency has no impact on convergence speed because all 
received messages are taken in account. This is because the 
information encapsulated in a message (the attribute value 
of a node) is guaranteed to be up to date, as long as the 
attribute values are constant. 

5.1. Ranking Algorithm Specification 

The pseudocode of the ranking algorithm is presented in 
Figure|5] As opposed to the ordering algorithm of the previ- 
ous section, the ranking algorithm does not assign random 
initial unalterable values as candidate ranks. Instead, the 
ranking algorithm improves its rank estimate each time a 
new message is received. 

The ranking algorithm works as follows. Periodically 
each node i updates its view Afi following an underlying 
protocol that provides a uniform random sample (Line|3]i; 
later, we simulate the algorithm using a variant of Cyclon 
protocol. See |6| for further details. Node i computes its 
rank estimate (and hence its slice) by comparing the at- 
tribute value of its neighbors to its own attribute value. This 
estimate is set to the ratio of the number of nodes with a 
lower attribute value that i has seen over the total number of 
nodes i has seen (LineflSll. Node i looks at the normalized 
rank estimate of all its neighbors. Then, i selects the node ji 
closest to a slice boundary (according to the rank estimates 
of its neighbors). Node i selects also a random neighbor j2 
among its view (Line[T2]i. When those two nodes are se- 



lected, i sends an update message, denoted by a flag UPD, 
to ji and j2 containing its attribute value (Line[T3}{T4]i. 

The reason why a node close to the sUce boundary is 
selected as one of the contacts is that such nodes need more 
samples to accurately determine which slice they belong to 
(subsection l5.2l shows this point). This technique introduces 
a bias towards them, so they receive more messages. 

Upon reception of a message from node i, the passive 
threads of ji and j2 are activated so that ji and j2 compute 
their new rank estimate rj-^ and rj^. The estimate of the 
slice a node belongs to, follows the computation of the rank 
estimate. Messages are not replied, communication is one- 
way, resulting in identical message complexity to JK and 
mod-JK. 

5.2. Theoretical Analysis 

The following Theorem shows a lower bound on the 
probability for a node i to accurately estimate the slice it 
belongs to. This probability depends not only on the num- 
ber of attribute exchanges but also on the rank estimate of i. 
For the proof of Theorem l5 . 1 I please refer to the full version 
of this paper 16|. 

Theorem 5.1 Let p be the normalized rank ofi and let p be 
its estimate. For node i to exactly estimate its slice with con- 
fidence coefficient of 100(1 — a)%, the number of messages 
i must receive is: 

where d is the distance between the rank estimate of i and 
the closest slice boundary, and Zi^ represents the endpoints 
of the confidence interval. 

To conclude, under reasonable assumptions all node esti- 
mate its slice with confidence coefficient 100(1 — after 
a finite number of message receipts. Moreover a node closer 



Initial state of node i 

(1) period ^, initially set to a constant; r,;, a value in (0,1]; 
ai, the attribute value; b, the closest slice boundary to node i; 
Qi, the counter of encountered attribute values; li, the counter 
of lower attribute values; slicei <— -L; A/i, the view. 

Active thread at node i 

(2) wa\t{periodi) 

(3) recompute-view()i 

(4) dist-min <— oo 

(5) forj'GM 

(6) 9i ^ 9i + 1 

(7) if ay < at then it + 1 

(8) if dist(aj/ , 6) < (ijs4-mm then 

(9) dist-min <— dist{aj/ , b) 

(10) ji ^ j' 

(11) end for 

(12) Let j'z be a random node of A/i 

(13) send(UPD,a,) to ji 

(14) send(UPD,ai) to j2 

(15) n ^ khr 

(16) slice <— 5; such that I < Vi <u 

Passive thread at node i activated upon reception 

(17) recv(UPD,aj ) fromjr' 

(18) if < a; then 

(19) g^'^g^ + \ 

(20) r, ^ l,lg, 

(21) sKce <— Si.u such that I < ri < u 



Figure 5. Dynamic ranking algorithm. 

to the slice boundary needs more messages than a node far 
from the boundary. 

5.3. Simulation Results 

This section evaluates the ranking algorithm by focus- 
ing on three different aspects. First, the performance of the 
ranking algorithm is compared to the performance of the 
ordering algorithm in a large-scale system where the distri- 
bution of attribute values does not vary over time. Second, 
we investigate if sufficient uniformity is achievable in real- 
ity using a dedicated protocol. Third, the ranking algorithm 
(with and without sliding window technique) and ordering 
algorithm are compared in a dynamic system where the dis- 
tribution of attribute values may change. For this purpose, 
we ran two simulations, one for each algorithms. The sys- 
tem contains (initially) 10* nodes and each view contains 
10 uniformly drawn random nodes and is updated in each 
cycle. The number of slices is 100, and we present the evo- 
lution of the slice disorder measure over time. 

Performance comparison in tlie static case. Figure [6(a)| 
compares the ranking algorithm to the ordering algorithm 
while the distribution of attribute values do not change over 
time (varying distribution is simulated below). The differ- 
ence between the ordering algorithm and the ranking al- 



gorithm indicates that the ranking algorithm gives a more 
precise result (in terms of node to slice assignments) than 
the ordering algorithm. More importantly, the slice disor- 
der measure obtained by the ordering algorithm is lower 
bounded while the one of the ranking algorithm is not. Con- 
sequently, this simulation shows that the ordering algorithm 
might fail in slicing the system while the ranking algorithm 
keeps improving its accuracy over time. 

Feasibility of tlie ranking algorittim. Figure [6(b)| shows 
that the ranking algorithm does not need artificial uniform 
drawing of neighbors. Indeed, an underlying view manage- 
ment protocol might lead to similar performance results. 
In the presented simulation we used an artificial protocol, 
drawing neighbors randomly at uniform in each cycle of 
the algorithm execution, and the variant of the Cyclon view 
management protocol presented above. Those underlying 
protocols are distinguished on the figure using terms "uni- 
form" (for the former one) and "views" (for the latter one). 
This figure shows that both cases give very similar results. 
The SDM legend is on the right-handed vertical axis while 
the left-handed vertical axis indicates what percentage the 
SDM difference represents over the total SDM value. At 
any time during the simulation (and for both type of algo- 
rithms) its value remains within plus or minus 7%. The two 
SDM curves of the ranking algorithm almost overlap. To 
conclude, the use of an underlying distributed protocol that 
shuffles the view among nodes can be easily used with the 
ranking algorithm to provide results similar to the optimal. 

Performance comparison in tlie dynamic case. In Fig- 
ure |6(c)]the two curves represent the slice disorder measure 
obtained using the ordering algorithm and the ranking al- 
gorithm. We simulate the churn such that 0.1% of nodes 
leave and join in each of the 200 first cycles. We observe 
how the SDM converges. The churn is reasonably and pes- 
simistically tuned compared to recent experimental evalua- 
tions 1 20 1 of the session duration in three well-known P2P 
systems. 

The distribution of the churn is correlated with the at- 
tribute values: the leaving nodes are the nodes with the 
lowest attribute values while the entering nodes have higher 
attribute values than all nodes already in the system. The 
churn introduces a significant disorder in the system which 
counters the fast decrease. When, the churn stops, the rank- 
ing algorithm readapts well the slice assignments: the SDM 
starts decreasing again. However, in the ordering algorithm, 
the convergence of SDM gets stuck. This leads to a poor 
slice assignment accuracy. 

In Figure |6(d)[ the three curves represent the slice dis- 
order measure obtained using the ordering algorithm, the 
ranking algorithm, and a modified version of the ranking 
algorithm with sliding-windows. (The simulation obtained 
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Figure 6. (a) Comparing the ordering algorithm and the ranl<ing algorithms, (b) Comparing the 
uniform drawing and the underlying variant of Cyclon. (c) Effect of burst of attribute-correlated 
churn, (d) Effect of a low and regular attribute-correlated churn. 



using sliding windows is described in the next subsection.) 
The churn is diminished and made more regular than in the 
previous simulation such that 0. 1 % of nodes leave and join 
every 10 cycles. 

The decrease slope diminishes and the churn effect re- 
duces the amount of nodes with a low attribute value while 
increasing the amount of nodes with a large attribute value. 
This unbalance leads to a messy slice assignment, that is, 
each node must quickly find its new slice to prevent the 
SDM from increasing. In the ordering algorithm, the SDM 
increases faster than in the ranking algorithm. Unlike the 
ordering algorithm, the ranking one keeps re-estimating the 
rank of each node depending on the attribute values present 
in the system. Since the churn increases the attribute values 
present in the system, nodes tend to receive more messages 
with higher attribute values and less messages with lower 
attribute values, which turns out to keep the SDM low, de- 
spite churn. To conclude, the results show that when the 
churn is related to the attribute (e.g., attribute represents the 
session duration, uptime of a node), then the ranking algo- 
rithm is better suited than the ordering algorithm. 



Sliding-window for limiting tlie SDM increase. In Fig- 
ure |6(d)[ the "sliding-window" curve presents a slightly 
modified version of the ranking algorithm that encompasses 
SDM increase. In the ranking algorithm, upon reception 
of a new message each node i re-computes immediately its 
rank estimate and the slice it thinks it belongs to. Conse- 
quently the messages received long-time ago have as much 
importance as the fresh messages in the estimate of i. The 
drawback, as it appeared in Figure |6(d)| of Section 14.51 is 
that if the attribute values are correlated with churn, then 
the precision of the algorithm might diminish. 

To cope with this issue, upon reception of a message, 
each node records an information, about whether the at- 
tribute value received is larger or lower than the current 
one. Say this information is stored in a first-in first-out 
buffer such that only the most recent values remain. (One 
might consider this buffer as a sliding-window.) Right after 
having recorded this information, node i can re-compute its 
rank estimate and its slice estimate based on the piece of 
information in the buffer Consequently, this improvement 
increases the algorithm tolerance to change. 



6. Conclusion 

Allocating resources to applications and isolating capa- 
ble nodes require specific algorithms that partition the net- 
work in a relevant way. The sorting algorithm lfT3l provided 
a first attempt to "slice" the network. 

In this paper, we first proposed the ordering algorithm 
that improves over this sorting algorithm. This improve- 
ment comes from a judicious choice of candidate nodes to 
swap values. Second, we showed that the existing global 
disorder measure can not indicate whether nodes find their 
slice. That is, we defined the slice disorder measure to mea- 
sure how nodes wrongly estimate their slice. Using this new 
measure, two problems related to the use of static random 
values arose. The first one refers to the fact that slice as- 
signment heavily depends on the degree of uniformity of 
the initial random value. The second one is related to the 
fact that the churn (or failures) might be correlated with the 
attribute, leading to a wrong slice assignment. 

Last but not least, we provided a ranking algorithm to 
solve these problems This algorithm minimizes the effect 
of correlated churn on slice disorder and recovers efficiently 
after a period of correlated churn. The convergence speed 
up of the ordering algorithm and the accuracy of the rank- 
ing algorithm are proved through theoretical analysis and 
simulations. 
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